CIRCUIT AND METHOD FOR PERFORMING A TWO-DIMENSIONAL 
TRANSFORM DURING THE PROCESSING OF AN IMAGE 



Technical Field: 

The invention relates generally to image processing circuits and techniques, 
and more particularly to a circuit and method for performing a two-dimensional 
transform, such as an Inverse-Discrete-Cosine-Transform (IDCT), during the 
processing of an image. Such a circuit and method can perform an IDCT more 
efficiently than prior circuits and methods. 

Background of the Invention: 

It is often desirable to decrease the complexity of an image processor that 
compresses or decompresses image data. Because image data is often arranged in 
two-dimensional (2-D) blocks, the processor often executes 2-D mathematical 
functions to process the image data. Unfortunately, a processor having a relatively 
complex architecture is typically required to execute these complex image- 
processing functions. The complex architecture often increases the size of the 
processor's arithmetic unit and its internal data busses, and thus often increases the 
cost and overall size of the processor as compared to standard processors. 

One technique for effectively reducing the complexity of an image processor's 
architecture is to break down the complex image-processing functions into a series 
of simpler functions that a simpler architecture can handle. For example, a paper by 
Masaki et al., which is incorporated by reference, discloses a technique for breaking 
- down an 8-point vector multiplication into a series of 4-point vector multiplications to 
simplify a 2-D IDCT. VLSI Implementation oflnversed Discrete Cosine Transformer 
and Motion Compensator for MPEG2 HDTV Video Decoding, IEEE Transactions On 
Circuits And Systems For Video Technology, Vol. 5, No. 5, October, 1995. 

Unfortunately, although such a technique allows the processor to have a 
simpler architecture, it often increases the time that the processor needs to process 
the image data. Thus, the general rule is that the simpler the processor's 
architecture, the slower the processing time, and the more complex the processor's 
architecture, the faster the processing time. 
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To help the reader understand the concepts discussed above and those 
discussed below in the Description of the Invention, following is a basic overview of 
conventional image compression/decompression techniques, the 2-D DCT function 
and the 2-D and 1-D I DCT functions, and a discussion of Masaki's technique for 
5 simplifying the 1-D IDCT function. 

Overview of Conventional Image-Compression/Decompression Techniques 

To electronically transmit a relatively high-resolution image over a relatively 

10 low-band-width channel, or to electronically store such an image in a relatively small 
memory space, it is often necessary to compress the digital data that represent the 
image. Such image compression typically involves reducing the number of data bits 
that are necessary to represent an image. For example, High-Definition-Television 
(HDTV) video images are compressed to allow their transmission over existing 

15 television channels. Without compression, HDTV video images would require 
transmission channels having bandwidths much greater than the bandwidths of 
existing television channels. Furthermore, to reduce data traffic and transmission 
time to acceptable levels, one may compress an image before sending it over the 
internet. Or, to increase the image-storage capacity of a CD-ROM or server, one 

20 may compress an image before storing it. 

Referring to Figures 1A - 6, the basics of the popular block-based Moving 
Pictures Experts Group (MPEG) compression standards, which include MPEG-1 and 
MPEG-2, are discussed. For purposes of illustration, the discussion is based on 
- using an MPEG 4:2:0 format to compress video images represented in a Y-C B -C R 

25 color space. However, the discussed concepts also apply to other MPEG formats, to 
images that are represented in other color spaces, and to other block-based 
compression standards such as the Joint Photographic Experts Group (JPEG) 
standard, which is often used to compress still images. Furthermore, although many 
details of the MPEG standards and the Y-Cb-Cr color space are omitted for brevity, 

30 these details are well known and are disclosed in a large number of available 
references. 

Referring to Figures 1A - 1D, the MPEG standards are often used to 
compress temporal sequences of images — video frames for purposes of this 
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discussion — such as found in a television broadcast. Each video frame is divided 
into subregions called macro blocks, which each include one or more pixels. Figure 
1A is a 16-pixel-by-1 6-pixel macro block 10 having 256 pixels 12 (not drawn to 
scale). The macro block 10 may have other dimensions as well. In the original 

5 video frame, i.e., the frame before compression, each pixel 12 has a respective 

luminance value Y and a respective pair of color-, i.e., chroma-, difference values Cb 
and C R ("B" indicates "Blue" and "R" indicates "Red"). 

Before compression of the video frame, the digital luminance (Y) and chroma- 
difference (Cb and Cr) values that will be used for compression, i.e., the pre- 

10 compression values, are generated from the original Y, C B) and C R values of the 
original frame. In the MPEG 4:2:0 format, the pre-compression Y values are the 
same as the original Y values. Thus, each pixel 12 merely retains its original 
luminance value Y. But to reduce the amount of data to be compressed, the MPEG 
4:2:0 format allows only one pre-compression Cb value and one pre-compression Cr 

15 value for each group 14 of four pixels 12. Each of these pre-compression Cb and Cr 
values are respectively derived from the original Cb and Cr values of the four pixels 
12 in the respective group 14. For example, a pre-compression C B value may equal 
the average of the original Cb values of the four pixels 12 in the respective group 14. 
Thus, referring to Figures 1B - 1D, the pre-compression Y, C B , and C R values 

20 generated for the macro block 10 are arranged as one 16x16 matrix 16 of pre- 
compression Y values (equal to the original Y values of the pixels 12), one 8x8 
matrix 18 of pre-compression C B values (equal to one derived C B value for each 
group 14 of four pixels 12), and one 8x8 matrix 20 of pre-compression Cr values 
" (equal to one derived C R value for each group 14 of four pixels 12). The matrices 16, 

25 1 8, and 20 are often called "blocks" of values. Furthermore, because the MPEG 

standard requires one to perform the compression transforms on 8 x 8 blocks of pixel 
values instead of on 16 x 16 blocks, the block 16 of pre-compression Y values is 
subdivided into four 8x8 blocks 22a - 22d, which respectively correspond to the 8 x 
8 pixel blocks A - D in the macro block 10. Thus, referring to Figures 1A - 1D, six 8 

30 x 8 blocks of pre-compression pixel data are generated for each macro block 10: four 
8x8 blocks 22a - 22d of pre-compression Y values, one 8x8 block 18 of pre- 
compression Cb values, and one 8x8 block 20 of pre-compression Cr values. 
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Figure 2 is a block diagram of an MPEG compressor 30, which is more 
commonly called an encoder. Generally, the encoder 30 converts the pre- 
compression data for a frame or sequence of frames into encoded data that 
represent the same frame or frames with significantly fewer data bits than the pre- 

5 compression data. To perform this conversion, the encoder 30 reduces or eliminates 
redundancies in the pre-compression data and reformats the remaining data using 
efficient transform and coding techniques. 

More specifically, the encoder 30 includes a frame-reorder buffer 32, which 
receives the pre-compression data for a sequence of one or more video frames and 

10 reorders the frames in an appropriate sequence for encoding. Typically, the 
reordered sequence is different than the sequence in which the frames are 
generated and will be displayed. The encoder 30 assigns each of the stored frames 
to a respective group, called a Group Of Pictures (GOP), and labels each frame as 
either an intra (I) frame or a non-intra (non-l) frame. For example, each GOP may 

1 5 include three I frames and twelve non-l frames for a total of fifteen frames. The 
encoder 30 always encodes the macro blocks of an I frame without reference to 
another frame, but can and often does encode the macro blocks of a non-l frame 
with reference to one or more of the other frames in the GOP. The encoder 30 does 
not, however, encode the macro blocks of a non-I frame with reference to a frame in 

20 a different GOP. 

Referring to Figures 2 and 3, during the encoding of an I frame, the 8 x 8 
blocks (Figures 1B - 1D) of the pre-compression Y, Cb, and Cr values that represent 
the I frame pass through a summer 34 to a Discrete Cosine Transformer (DCT) 36, 
- which transforms these blocks of pixel values into respective 8x8 blocks of one DC 

25 (zero frequency) transform value D 0 o and sixty-three AC (non-zero frequency) 

transform values D 0 i - D77. Referring to Figure 3, these DCT transform values are 
arranged in an 8 x 8 transform block 37, which corresponds to a block of pre- 
compression pixel values such as one of the pre-compression blocks of Figures 1B - 
1 D. For example, the block 37 may include the luminance transform values D Y oo - 

30 Dy77 that correspond to the pre-compression luminance values Y (0 , o)a - Y(7, 7)a in the 
pre-compression block 22a of Figure 1B. Furthermore, the pre-compression Y, C Bj 
and C R values pass through the summer 34 without being summed with any other 
values because the encoder 30 does not use the summer 34 for encoding an I 
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frame. As discussed below, however, the encoder 30 uses the summer 34 for 
motion encoding macro blocks of a non-1 frame. 

Referring to Figures 2 and 4, a quantizer and zigzag scanner 38 limits each of 
the transform values D from the DCT 36 to a respective maximum value, and 
5 provides the quantized AC and DC transform values on respective paths 40 and 42 
in a zigzag pattern. Figure 4 is an example of a zigzag scan pattern 43, which the 
quantizer and zigzag scanner 38 may implement. Specifically, the quantizer and 
zigzag scanner 38 provides the transform values D from the transform block 37 
(Figure 3) on the respective paths 40 and 42 in the order indicated. That is, the 

1 0 quantizer and scanner 38 first provides the transform value D in the "0" position, Le, 
Doo, on the path 42. Next, the quantizer and scanner 38 provides the transform 
value D in the "1" position, i.e., Doi, on the path 40. Then, the quantizer and scanner 
38 provides the transform value D in the "2" position, i.e., Di 0 , on the path 40, and so 
on until at last it provides the transform value D in the "63" position, i.e., D77, on the 

15 path 40. Such a zigzag scan pattern decreases the number of bits needed to 

represent the encoded image data, and thus increases the coding efficiency of the 
encoder 30. Although a specific zigzag scan pattern is discussed, the quantizer and 
scanner 38 may scan the transform values using other scan patterns depending on 
the coding technique and the type of images being encoded. 

20 Referring again to Figure 2, a prediction encoder 44 predictively encodes the 

DC transform values, and a variable-length coder 46 converts the quantized AC 
transform values and the quantized and predictively encoded DC transform values 
into variable-length codes such as Huffman codes. These codes form the encoded 
- data that represent the pixel values of the encoded I frame. 

25 A transmit buffer 48 temporarily stores these codes to allow synchronized 

transmission of the encoded data to a decoder (discussed below in conjunction with 
Figure 5). Alternatively, if the encoded data is to be stored instead of transmitted, 
the coder 46 may provide the variable-length codes directly to a storage medium 
such as a CD-ROM. 

30 A rate controller 50 ensures that the transmit buffer 48, which typically 

transmits the encoded frame data at a fixed rate, never overflows or empties, i.e., 
underflows. If either of these conditions occurs, errors may be introduced into the 
encoded data stream. For example, if the buffer 48 overflows, data from the coder 
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46 is lost. Thus, the rate controller 50 uses feedback to adjust the quantization 
scaling factors used by the quantizer and zigzag scanner 38 based on the degree of 
fullness of the transmit buffer 48. Specifically, the fuller the buffer 48, the larger the 
controller 50 makes the scale factors, and the fewer data bits the coder 46 
5 generates. Conversely, the more empty the buffer 48, the smaller the controller 50 
makes the scale factors, and the more data bits the coder 46 generates. This 
continuous adjustment ensures that the buffer 48 neither overflows nor underflows. 

Still referring to Figure 2, the encoder 30 uses a dequantizer and inverse 
zigzag scanner 52, an inverse DCT 54, a summer 56, a reference frame buffer 58, 

10 and a motion predictor 60 to motion encode macro blocks of non-l frames. 

Figure 5 is a block diagram of a conventional MPEG decompresser 62, which 
is commonly called a decoder and which can decode frames that are encoded by the 
encoder 30 of Figure 2. 

Referring to Figures 5 and 6, for I frames and macro blocks of non-l frames 

15 that are not motion predicted, a variable-length decoder 64 decodes the variable- 
length codes received from the encoder 30. A prediction decoder 66 decodes the 
predictively encoded DC transform values, and a dequantizer and inverse zigzag 
scanner 67, which is similar or identical to the dequantizer and inverse scanner 52 of 
Figure 2, dequantizes and rearranges the decoded AC and DC transform values. An 

20 inverse DCT 68, which is similar or identical to the inverse DCT 54 of Figure 2, 
transforms the dequantized transform values into inverse transform (IDCT) values, 
i.e., recovered pixel values. Figure 6 is an 8 x 8 inverse-transform block 70 of 
inverse transform values l 00 - 177, which the inverse DCT 68 generates from the 
- block 37 of transform values D 0 o - D77 (Figure 3). For example, if the block 37 

25 corresponds to the block 22a of pre-compression luminance values Y A (Figure 1 B), 
then the inverse transform values loo - 177 are the decoded luminance values for the 
pixels in the 8 x 8 block A (Figure 1). But because of the information losses that 
quantization and dequantization cause, the inverse transform values I are often 
different than the respective pre-compression pixel values they represent. 

30 Fortunately, these losses are typically too small to cause visible degradation to a 
decoded video frame. 

Still referring to Figure 5, the decoded pixel values from the inverse DCT 68 
pass through a summer 72 — used during the decoding of motion-predicted macro 
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blocks of non-1 frames as discussed beiow — into a frame-reorder buffer 74, which 
stores the decoded frames and arranges them in a proper order for display on a 
video display unit 76. If a decoded frame is also used as a reference frame for 
purposes of motion decoding, then the decoded frame is also stored in the 
5 reference-frame buffer 78. 

The decoder 62 uses the motion interpolator 80, the prediction encoder 66, 
and the reference-frame buffer 78 to decode motion-encoded macro blocks of non-1 
frames. 

Referring to Figures 2 and 5, although described as including multiple 
10 functional circuit blocks, one may implement the encoder 30 and the decoder 62 in 
hardware, software, or a combination of both. For example, designers often 
implement the encoder 30 and decoder 62 with respective processors that perform 
the respective functions of the above-described circuit blocks. 

More detailed discussions of the MPEG encoder 30 and the MPEG decoder 
15 62 of Figures 2 and 5, respectively, of motion encoding and decoding, and of the 
MPEG standard in general are presented in many publications including "Video 
Compression" by Peter D. Symes, McGraw-Hill, 1998, which is incorporated by 
reference. Furthermore, other well-known block-based compression techniques are 
available for encoding and decoding video frames and still images. 

20 

Discrete Cosine Tranform and Inverse Discrete Cosine Transform 

The 2-D DCT F(v, u) is given by the following equation: 



1 ) F(v,«) = icMCiu)^ I/^-)co<^)cos(^ 



y=0 x=0 

25 C(v) = -~ for v - 0, C(v) = 1 otherwise 

C(u) = — !=3 for u = 0, C(u) = 1 otherwise 

y[2 



where v is the row and u is the column of the corresponding transform block. For 
example, if F(v, u) represents the block 37 (Figure 3) of transform values, then F(1, 
30 3) = D13. Likewise, f(y, x) is the pixel value in row y, column x of the corresponding 
pre-compression block. For example, if f(y, x) represents the block 22a (Figure 1B) 
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of pre-compression luminance values, then f(0, 0) = Y(o,o)a- Thus, each transform 
value F(v, u) depends on all of the pixel values f(y, x) in the corresponding pre- 
compression block. 

The 2-D matrix form of F(v, u) is given by the following equation: 

2) F(v,u) = f.R vu 



where f is a 2-D matrix that includes the pixel values f(y, x), and R vu is a 2-D matrix 
that one can calculate from equation (1) and that is unique for each respective pair of 
10 coordinates v and u. 

The IDCT f(y, x), which is merely the inverse of the DCT F(v, u) is given by 
the following equation: 



3 > = |rS£c(v)C(«)F(v.«)cos(^) cos ((^ 



v=0 u=Q 



1 5 C(u) - for u = 0, C(u) - 1 otherwise 

C(v) = -^r for v = 0, C(v) = 1 otherwise 

y/2 



where y is the row and x is the column of the inverse-transform block. For example, 
if f(y, x) represents the block 70 (Figure 6) of inverse transform values, then f(7, 4) = 
20 l 74 . 

The 2-D matrix form of f(y, x) is given by the following equation: 



4) f(y,x) = F.R 



yx 



25 where F is a 2-D matrix that includes the transform values F(v, u), and R yx is a 2-D 
matrix that one can calculate from equation (3) and that is unique for each respective 
pair of coordinates y and x. 

To simplify the 2-D IDCT of equations (3) and (4), one can represent each 
respective row y of f(y, x) as a 1-D transform, and calculate f(y, x) as a series of 1-D 

30 IDCT's. The 1-D IDCT is given by the following equation: 
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nrH^l ( \l v — v = row 

5) ^W = VlEC( W )Fv( M )cos(^f 



For example purposes, using the 1-D IDCT equation (5) to calculate the 
inverse-transform values l 0 o - I77 of the block 70 (Figure 6) from the transform values 
5 Dqo - D77 of the block 37 (Figure 3) is discussed. The 8x8 matrices F and f that 
respectively represent the 8 x 8 blocks 37 and 70 in-mathematical form are given by 
the following equations: 



6) 



FO(m) 
F = F(v,M) = Fl(tt) 

F7(u) 



^07 ^06 ^05 ^04 ^03 ^02 ^01 ^00 

DvD 16 D l5 D u D n D u D u D w 



Pi! ^76 ^75 "^74 "^73 "^72 ^71 
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7) 



f0(x) 
f = f(y,x) = f\(x) 

f7(x) 



hi he hs hi A)3 hi ^01 ho 



hi he hs h* hs hi hi I\ 
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hi he hs h$ hs hi hi ho 
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F0(u) - F7(u) are the rows of the matrix F and thus represent the respective rows of 
the block 37, and f0(x) - f7(x) are the rows of the matrix f and thus represent the 
respective rows of the block 70. 

First, one calculates an intermediate 8x8 block of intermediate inverse- 
transform values l\ which are represented by the 1-D transform fv(x), according to 
the following equation, which is equation (5) in matrix form: 
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8) 



f0(x) = R yv *FO(u): 



^07 ^06 ^05 -^04 ^03 -^02 ^01 ^00 
^17 ^16 #15 #14 #L3 #12 #11 #10 



L i? 77 R 76 R 75 R 74 R n R n R n R n 
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ri{x) = R yv .F7(u) 

R yv is a 2-D matrix that one can calculate from equation (5) and that is unique for 
each respective pair of coordinates y and v. Thus, the intermediate matrix f is given 
5 by the following equation: 
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To calculate the final matrix f of the inverse-transform values l 0 o - hi of the 
10 block 70 (Figure 6), one transposes the intermediate matrix f to obtain f T , replaces 
the transform rows F0(u) - F7(u) in equation (8) with the rows f T 0(x) - f T 7(x) off 1 ", 
and then recalculates equation (8). F T is given by the following equation: 



15 
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The subscript coordinates of the inverse-transform values I' in equation (10) are the 
same as those in equation (9) to clearly show the transpose. That is, l f io of equation 
(10) equals l'io of equation (9). Thus, to transpose a matrix, one merely 
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interchanges the rows and respective columns within the matrix. For example, the 
first row of f becomes the first column of f T , the second row of f becomes the 
second column of f' T , and so on. The following equation shows the calculation of the 
inverse-transform matrix f: 



fO(x) = R yv ^f T 0(x) 



11) 



-^07 ^06 ^05 ^04 ^03 ^02 ^01 ^00 
^17 ^16 ^15 ^14 ^13 ^12 ^11 ^10 
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10 Thus, equation (11) gives the inverse-transform values loo - hi of the block 70 

(Figure 6). 

Referring to equations (8) - (1 1), although splitting the 2-D IDCT into a series 
of two 1-D IDCTs simplifies the mathematics, these equations still involve a large 
number of 8-point-vector-by-8-point-vector multiplications for converting the 8 x 8 
1 5 block 37 (Figure 3) of transform values into the 8 x 8 block 70 (Figure 7) of inverse- 
transform values. For example, an 8-value matrix row times an 8-value matrix 
column (e.g., equation (11)), is an 8-point-vector multiplication. Unfortunately, 
processors typically require a relatively complex architecture to handle vector 
multiplications of this size. 

20 

Masaki's IDCT Technique 

As discussed in his paper, Masaki further simplifies the 1-D IDCT equations 
(8) - (11) by breaking the 8-point-vector multiplications down into 4-point-vector 
multiplications. This allows processors with relatively simple architectures to convert 
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the block 37 (Figure 3) of transform values into the block 70 (Figure 6) of inverse- 
transform values. 

The following equation gives the first row of even and odd Masaki values de 
and do from which one can calculate the first row of intermediate inverse-transform 
5 values Too - I'o7 from the matrix of equation (9): 



12) 
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10 Doo - D 0 7 are the values in the first row of the transform block 37, M e o - M e f are the 
even Masaki coefficients, and M 0 o - M 0 f are the odd Masaki coefficients. The values 
of the even and odd Masaki coefficients are given in Masaki's paper, which is 
heretofore incorporated by reference. One calculates the remaining rows of Masaki 
values — seven, one for each remaining row of transform values in the block 37 — 

15 in a similar manner. 

One calculates the intermediate inverse-transform values Too - l'o7 from the 
even and odd Masaki values de and do of equation (12) according to the following 
equation: 



20 



13) 



00 



01 



02 



03, 



=yiPD,+y 1 QD e 



(PD 0 +QD e ) 



07 



06 



05 



04 



y 2 PD 0 -y 2 QD e 



(PD 0 -QD e ) 



12 



One calculates the remaining rows of intermediate inverse-transform values 1' in a 
similar manner. 

Figure 7 is a block 82 of the values P generated by the group of Masaki 
equations represented by the equation (13). Accordingly, the last four values in each 

5 row, i.e., P y 4 - P y 7, are in inverse order. 

Referring to Figure 8, one generates a properly ordered block 84 of the values 
P by putting P y4 - P y 7 in the proper order. Unfortunately, this reordering takes 
significant processing time. 

Next, referring to Figure 9, in a manner similar to that described above in 

10 conjunction with equations (9) and (10), one calculates the final inverse-transform 
values l yx by transposing the block 84 (Figure 8) to generate a transposed block 86 
and by replacing the row of transform values Doo - D 0 7 in equation (12) with the 
respective rows of the transposed block 86. 

But referring to Figure 10, equation (12) requires one to separate the row of 

15 transform values D into an even group D 0 o, D 0 2, D 0 4, and D 06 and an odd group D 0 i, 
D03, Dos, and D 0 7. Therefore, one must also separate the rows of intermediate 
inverse-transform values P into respective even groups P y o, Py2, Py4, and P y 6 and odd 
groups P y i, P y 3, P y 5, and P y7 . Thus, one performs this even-odd separation on the 
block 86 (Figure 9) to generate an even-odd separated block 88 of the intermediate 

20 values P. Replacing the row of transform values D 0 o - D 0 7 in equation (12) with the 
respective rows of the block 88, one generates intermediate Masaki vectors P'D 0 
and Q'D e and generates the final inverse-transform values I according to the 
following equation: 
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Referring to Figure 1 1 , using equation (14) for each set of intermediate 
Masaki vectors generates a block 90 in which the last four inverse-transform values 
l y4 - l y 7 in each row are in inverse order. Therefore, one generates the properly 
ordered block 70 (Figure 3) by putting l y4 - l y7 in the proper order. Unfortunately, this 
reordering takes significant processing time. 

Therefore, although Masaki's technique may simplify the processor 
architecture by breaking down 8-point-vector multiplications into 4-point-vector 
multiplications, it typically requires more processing time than the 8-point technique 
due to Masaki's time-consuming block transpositions and rearrangements. 



SUMMARY OF THE INVENTION 

In one aspect of the invention, an image decoder includes a memory and a 
processor coupled to the memory. The processor is operable to store a column of 
intermediate values in the memory as a row of intermediate values, combine the 
15 intermediate values within the stored row to generate a column of resulting values, 
and store the resulting values in the memory as a row of resulting values. 

Such an image decoder can store the Masaki values in a memory register 
such that when the processor combines these values to generate the intermediate 
inverse-transform values, it stores these values in a transposed fashion. Thus, such 
20 an image decoder reduces the image-processing time by combining the generating 
and transposing of the values T into a single step. 

In a related aspect of the invention, the intermediate values include a first 
even-position even intermediate value, an odd-position-even intermediate value, a 
- second even-position even intermediate value, a first even-position odd intermediate 
25 value, an odd-position odd intermediate value, and a second even-position odd 
intermediate value. The processor stores the first even-position even intermediate 
value and the first even-position odd intermediate value in a first pair of adjacent 
storage locations. The processor also stores the second even-position even 
intermediate value and the second even-position odd intermediate value in a second 
30 pair adjacent storage locations, the second pair of storage locations being adjacent 
to the first pair of storage locations. 

Such an image decoder can store the Masaki values in a memory register 
such that when the processor combines these values to generate the intermediate 
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inverse-transform values, it stores these values in a transposed and even-odd- 
separated fashion. Thus, such an image decoder reduces the image-processing 
time by combining the generating, transposing, and even-odd separating of the 
values T into a single step. 

5 

BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 A is a diagram of a conventional macro block of pixels in an image. 

Figure 1B is a diagram of a conventional block of pre-compression luminance 
values that respectively correspond to the pixels in the macro block of Figure 1 A. 
1 0 Figures 1C and 1 D are respective diagrams of conventional blocks of pre- 

compression chroma values that respectively correspond to the pixel groups in the 
macro block of Figure 1 A. 

Figure 2 is a block diagram of a conventional MPEG encoder. 

Figure 3 is a block of transform values that the encoder of Figure 2 generates. 
15 Figure 4 is a conventional zigzag scan pattern that the quantizer and zigzag 

scanner of Figure 2 implements. 

Figure 5 is a block diagram of a conventional MPEG decoder. 

Figure 6 is a block of inverse transform values that the decoder of Figure 5 
generates. 

20 Figure 7 is a block of intermediate inverse-transform values according to 

Masaki's technique. 

Figure 8 is a block having the intermediate inverse-transform values of Figure 
7 in sequentially ordered rows. 

Figure 9 is a block having the intermediate inverse-transform values of Figure 
25 8 in a transposed arrangement. 

Figure 10 is a block having the intermediate inverse-transform values of 
Figure 9 in an even-odd-separated arrangement. 

Figure 11 is a block of final inverse-transform values according to Masaki's 
technique. 

30 Figure 12 is a block diagram of an image decoder according to an 

embodiment of the invention. 

Figure 13 is a block diagram of the processor of Figure 12 according to an 
embodiment of the invention. 
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Figure 14A illustrates a pair-wise add operation that the processor of Figure 
13 executes according to an embodiment of the invention. 

Figure 14B illustrates a pair-wise subtract operation that the processor of 
Figure 13 executes according to an embodiment of the invention. 
5 Figure 15 illustrates a register map function that the processor of Figure 13 

executes according to an embodiment of the invention. 

Figure 16 illustrates a dual-4-point-vector-multiplication function that the 
processor of Figure 13 executes according to an embodiment of the invention. 

Figure 17 illustrates an implicit-matrix-transpose function that the processor of 
10 Figure 13 executes according to an embodiment of the invention. 

Figure 18 illustrates an implicit-matrix-transpose-and-even-odd-separate 
function that the processor of Figure 13 executes according to an embodiment of the 
invention. 



1 5 DETAILED DESCRIPTION OF THE INVENTION 

Figure 12 is a block diagram of an image decoder 100 according to an 
embodiment of the invention. The decoder 100 significantly decreases Masaki's 
IDCT time by calculating and transposing the intermediate inverse-transform values 
I 5 in the same step as discussed below in conjunction with Figure 17. That is, the 

20 decoder 100 generates the block 86 (Figure 9) of transposed values V directly from 
equation (13), and thus omits the generation of the blocks 82 (Figure 7) and 84 
(Figure 8). The decoder 100 may further decrease Masaki's IDCT conversion time 
by calculating, transposing, and even-odd separating the intermediate inverse- 
- transform values V in the same step as discussed below in conjunction with Figure 

25 18. That is, the decoder 100 generates the block 88 (Figure 10) of transposed 

values I' directly from equation (13), and thus omits the generation of the blocks 82, 
84, and 86. 

The decoder 100 includes an input buffer 102, a processor unit 104, and an 
optional frame buffer 106. The input buffer 102 receives and stores encoded data 
30 that represents one or more encoded images. The processor unit 104 includes a 
processor 108 for decoding the encoded image data and includes a memory 110. If 
the received encoded image data represents video frames, then the decoder 100 
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includes the optional frame buffer 106 for storing the decoded frames from the 
processing unit 104 in the proper order for storage or display. 

Figure 13 is a block diagram of a computing unit 112 of the processor 108 
(Figure 12) according to an embodiment of the invention. The unit 112 includes two 

5 similar computing clusters 114a and 114b, which typically operate in parallel. For 
clarity, only the structure and operation of the cluster 114a is discussed, it being 
understood that the structure and operation of the cluster 1 14b are similar. 
Furthermore, the clusters 1 14a and 1 14b may include additional circuitry that is 
omitted from Figure 13 for clarity. 

10 In one embodiment, the cluster 114a includes an integer computing unit (I- 

unit) 116a and an integer, floating-point, graphics computing unit (IFG-unit) 118a. 
The I-unit 116a performs memory-load and memory-store operations and simple 
arithmetic operations on 32-bit integer data. The IFG-unit 1 18a operates on 64-bit 
data and can perform complex mathematical operations that are tailored for 

15 multimedia and 3-D graphics applications. The cluster 114a also includes a register 
file 120a, which includes thirty two 64-bit registers RegO - Reg 32. The l-unit 1 16a 
and IFG-unit 118a can access each of these registers as respective upper and lower 
32-bit partitions, and the IFG-unit 1 18a can also access each of these registers as a 
single 64-bit partition. The l-unit 1 16a receives data from the register file 120a via 

20 32-bit busses 124a and 126a and provides data to the register file 120a via a 32-bit 
bus 128a. Likewise, the IFG-unit 1 18a receives data from the register file 120a via 
64-bit busses 130a, 132a, and 134a and provides data to the register file 120a via a 
64-bit bus 136a. 

Still referring to Figure 13, in another embodiment, the cluster 114a includes a 
25 128-bit partitioned-long-constant (PLC) register 136a and a 128-bit partitioned-Iong- 
variable (PLV) register 138a. The PLC and PLV registers 136a and 138a improve 
the computational throughput of the cluster 114a without significantly increasing its 
size. The registers 136a and 138a receive data from the register file 120a via the 
busses 132a and 134a and provide data to the IFG-unit 118a via 128-bit busses 
30 140a and 142a, respectively. Typically, IFG-unit 118a operates on the data stored in 
the registers 136a and 138a during its execution of special multimedia instructions 
that cause the IFG-unit 1 1 8a to produce a 32- or 64-bit result and store the result in 
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one of the registers RegO - Reg31 . In addition, these special instructions may cause 
the register file 132a to modify the content of the register 138a. 

In one embodiment, there is no direct path between the memory 108 (Figure 
12) and the PLC and PLV registers 136a and 138a. Therefore, the cluster 114a 

5 initializes these registers from the register file 120a before the IFG-unit 1 1 8a 

operates on their contents. Although the additional clock cycles needed to initialize 
these registers may seem inefficient, many multimedia applications minimize this 
overhead by using the data stored in the registers 136a and 138a for several 
different operations before reloading these registers. Furthermore, some instructions 

1 0 cause the cluster 1 14a to update the PLV register 1 38a while executing another 
operation, thus eliminating the need for additional clock cycles to load or reload the 
register 138a. 

Figure 14A illustrates a pair-wise add operation that the cluster 1 14a of Figure 
13 can execute according to an embodiment of the invention. For example 

15 purposes, RegO of the register file 120a (Figure 13) stores four 16-bit values a - d, 
and Reg1 stores four 16-bit values e - h. The IFG-unit 1 18a adds the contents of 
the adjacent partitions of RegO and Reg1, respectively, and loads the resulting sums 
into respective 16-bit partitions of Reg2 in one clock cycle. Specifically, the unit 
1 18a adds a and b and loads the result a + b into the first 16-bit partition of Reg2. 

20 Similarly, the unit 118a adds c and d, e and f, and g and h, and loads the resulting 
sums c + d, e + f, and g + h into the second, third, and fourth partitions, respectively, 
of Reg2. Furthermore, the unit 118a may divide each of the resulting sums a + b, c 
+ d, e + f ( and g + h by two before storing them in the respective partitions of Reg2. 
- The unit 1 18a right shifts each of the resulting sums by one bit to perform this 

25 division. 

Figure 14B illustrates a pair-wise subtract operation that the cluster 1 14a of 
Figure 13 can execute according to an embodiment of the invention. RegO stores 
the four 16-bit values a - d, and Reg1 stores the four 16-bit values e - h. The IFG- 
unit 1 18a subtracts the contents of the one partition from the contents of the adjacent 
30 partition and loads the resulting differences into the respective 16-bit partitions of 
Reg2 in one clock cycle. Specifically, the unit 118a subtracts b from a and loads the 
result a - b into the first 16-bit partition of Reg2. Similarly, the unit 1 1 8a subtracts d 
from c, f from e, and h from g, and loads the resulting differences a - b, c - d, e - f, 
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and g - h into the first, second, third, and fourth partitions, respectively, of Reg2. 
Furthermore, the unit 118a may divide each of the resulting differences a - b, c - d, e 
- f, and g - h by two before storing them in the respective partitions of Reg2. The unit 
1 18a right shifts each of the resulting differences by one bit to perform this division. 

5 Referring to Figures 14A and 14B, although RegO, Reg1, and Reg2 are 

shown divided into 16-bit partitions, in other embodiments the IFG-unit 118a 
performs the pair-wise add and subtract operations on partitions having other sizes. 
For example, RegO, Reg1 , and Reg2 may be divided into eight 8-bit partitions, two 
32-bit partitions, or sixteen 4-bit partitions. In addition, the IFG-unit 1 18a may 

10 execute the pair-wise add and subtract operations using registers other than RegO, 
Reg1, and Reg2. 

As discussed below in conjunction with Figures 16-18, the pair-wise add and 
subtract and divide-by-two features allows the IFG-unit 1 18a to calculate the 
intermediate and final inverse-transform values P and I from the Masaki values as 

15 shown in equations (13) and (14). 

Figure 15 illustrates a map operation that the cluster 1 14a of Figure 13 can 
execute according to an embodiment of the invention. For example, a source 
register RegO is divided into eight 8-bit partitions 0-7 and contains the data that the 
cluster 1 14a is to map into a destination register Reg1, which is also divided into 

20 eight 8-bit partitions 0-7. A 32-bit partition of a control register Reg2 (only one 32-bit 
partition shown for clarity) is divided into eight 4-bit partitions 0-7 and contains 
identification values that control the mapping of the data from the source register 
RegO to the destination register Reg1. Specifically, each partition of the control 
" register Reg2 corresponds to a respective partition of the destination register Reg1 

25 and includes a respective identification value that identifies the partition of the source 
register RegO from which the respective partition of the destination register Reg1 is 
to receive data. For example, the partition 0 of the control register Reg2 
corresponds to the partition 0 of the destination register Reg1 and contains an 
identifier value "2". Therefore, the cluster 1 14a loads the contents of the partition 2 

30 of the source register RegO into the partition 0 of the destination register Reg1 as 
indicated by the respective pointer between these two partitions. Likewise, the 
partition 1 of the control register Reg2 correspond to the partition 1 of the destination 
register Reg1 and contains the identifier value "5". Therefore, the cluster 1 14a loads 



19 



the contents of the partition 5 of the source register RegO into the partition 1 of the 
destination register Reg1 . The cluster 1 14a can aiso load the contents of one of the 
source partitions into multiple destination partitions. For example, the partitions 3 
and 4 of the control register Reg2 both include the identification value "6". 

5 Therefore, the cluster 1 14a loads the contents of the partition 6 of the source register 
RegO into the partitions 3 and 4 of the destination register Reg1 . In addition, the 
cluster 1 14a may not load the contents of a source partition into any of the 
destination partitions. For example, none of the partitions of the control register 
Reg1 contains the identity value "7". Thus, the cluster 1 14a does not load the 

10 contents of the partition 7 of the source register RegO into a partition of the 
destination register Regl 

As discussed below in conjunction with Figures 17 - 18, the cluster 1 14a 
performs the map operation to reorder the inverse-transform values I in the block 90 
(Figure 1 1) to obtain the block 70 (Figure 3). 

15 Figure 16 illustrates a 4-point-vector-product operation that the cluster 1 14a 

(Figure 13) can execute according to an embodiment of the invention. The cluster 
1 14a loads two 4-point vectors from the register file 120a into the PLC register 1 36a 
and two 4-point vectors into the register PLV 138a, where each vector value is 16 
bits. For example, during a first clock cycle, the cluster 1 14a loads the even-odd 

20 separated first row of transform values D 0 o, D 0 2, D 0 4, D 0 6 D 0 i, Dos, Dos, and D07 in the 
block 37 (Figure 3) into the PLC register 136a as shown. During a second clock 
cycle, the cluster 1 14a loads the first row of Masaki's four 16-bit even constants 
(equation (12)) and the first row of Masaki's four 16-bit odd constants into the PLV 
- register 138a as shown. During a third clock cycle, the IFG-unit 1 18a multiplies the 

25 contents of each corresponding pair of partitions of the registers 136a and 138a, 
adds the respective products, and loads the results into a 32-bit partition of RegO 
(only one 32-bit partition shown for clarity. That is, the unit 1 18a multiplies D 0 o by 
M e 3, D02 by M e 2, D 0 4 by M e i, D 0 6 by M e o, D 0 i by M 0 3, D03 by M 0 2, D 0 s by M 0 i, and D07 
by M 0 o, sums the products Doo x M e 3, D 02 x M e 2, D 0 4 x M e i, and D 0 e x M e o to generate 

30 the even Masaki value deoo, sums the products D01 x M 0 3, D03 x M 0 2, D05 x M 0 i, and 
D 0 7 x Moo to generate the odd Masaki value do 0 o, and loads de 0 o and do 0 o into 
respective halves of the 32-bit partition of RegO. As discussed below in conjunction 
with Figures 17 and 18, the unit 118a can use the pair-wise add and subtract and the 
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divided-by-two operations (Figures 14A - 14B) on the RegO to generate the 
intermediate inverse-transform values Too and IV of equation (13). 

Referring to Figures 13 and 16, because both clusters 1 14a and 1 14b can 
simultaneously perform four 4-point-vector-product operations, the computing unit 
112 can calculate QD e and PD 0 (equation (13)) for two rows of the transform values 
D (block 37 of Figure 3) in five clock cycles according to an embodiment of the 
invention. During the first clock cycle, the clusters 114a and 1 14b respectively load 
the first even-odd separated row of transform values D into the PLC register 136a 
and the second even-odd separated row of transform values into the PLC register 
136b. (The processor 108 even-odd separates the transform values using the map 
operation or as discussed below.) During the second cycle, the clusters 1 14a and 
1 14b load the first rows of the even and odd Masaki constants (M e o - M e 3 and M 0 o - 
M 0 3) into the PLV registers 138a and 138b, respectively, and respectively calculate 
de 00 and do 0 o and deio and doio as discussed above. During the third cycle, the 
clusters 114a and 114b load the second rows of the even and odd Masaki constants 
(Me4 - M e 7 and M o4 - M 0 y) into the PLV registers 138a and 138b, respectively, and 
respectively calculate de 0 i and dooi and den and don. During the fourth cycle, the 
clusters 1 14a and 1 14b load the third rows of the even and odd Masaki constants 
(M e s - M eb and M 0 s - M ob ) into the PLV registers 138a and 138b, respectively, and 
respectively calculate de 0 2 and do 0 2 and de t2 and do 12 . And during the fifth cycle, the 
clusters 1 14a and 1 14b load the fourth rows of the even and odd Masaki constants 
(M ec - M e f and M oc - M 0 f) into the PLV registers 138a and 138b, respectively, and 
respectively calculate de 0 3 and do 0 3 and dei 3 and do i3 . Thus, the computing unit 112 
- can calculate QD e and PD 0 significantly faster than prior processing circuits such as 
the one described by Masaki. 

In one embodiment, to save processing time during the calculation of QD e and 
PD 0j the processor 108 (Figure 12) even-odd separates the rows of the transform 
block 37 (Figure 3) for conformance with equation (12) during the inverse zigzag 
scan of the image data. For example, the processor 108 stores the first transform 
row in even-odd separated order, i.e., D 0 o, D02, D 0 4, D 0 6, D 0 i, D 0 3, D 0 5, and D07, as it 
reads this row from the input buffer 102. Thus, the processor 108 implements an 
inverse zigzag scan that stores the rows of the block 37 in even-odd-separated 
order. Since the processor 108 performs the inverse zigzag scan anyway, this even- 
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odd-separation technique adds no additional processing time. Conversely, 
execution of the map operation does add processing time. 

Figures 17 and 18 illustrate techniques for storing the Masaki values such 
that the computing unit 112 generates the transposed block 86 (Figure 9) or the 

5 transposed and even-odd separated block 88 (Figure 10) directly from the pair-wise 
add and subtract and divide-by-two operations that the unit 112 performs on the 
Masaki values. Thus, these techniques save significant processing time as 
compared to prior techniques that perform the re-ordering (blocks 82 and 84 of 
Figures 7 and 8, respectively), transposing, and even-odd separating as separate 

10 steps. 

Figure 17 illustrates an implicit block transpose that the computing unit 112 
performs according to an embodiment of the invention. As discussed above, this 
implicit transpose allows the unit 1 12 to generate the transposed block 86 (Figure 9) 
of values V directly from the pair-wise add and subtract and the divide-by-two 

15 operations (equations (13) and (14)). The brackets represent 64-bit registers of the 
register file 120a, and the parenthesis represent respective 32-bit partitions of these 
registers. Furthermore, the dual subscripts of the Masaki values indicate their 
position within their own row and identify the row of transform values D from which 
they were generated. For example, de 0 o is the first even Masaki value in the row of 

20 Masaki values, i.e., QD e , that were generated from the first row of transform values 
D 0 o - D 0 7 of the block 37 (Figure 3). Similarly, deio is the first even Masaki value in 
the row of Masaki values that were generated from the second row of transform 
values D10 - D17 of the block 37. 

Still referring to Figure 17, the computing unit 112 implicitly generates the 

25 transposed block 86 (Figure 9) by storing the combinations of de and do generated 
by the 4-point-vector-product operation in the proper 32-bit partitions of the registers 
Reg. Specifically, as discussed above in conjunction with Figure 16, the clusters 
114a and 114b stores corresponding pairs of de and do in respective 32-bit register 
partitions. The half sum (generated by the pair-wise add and divide-by-two 

30 operations) of a pair produces one intermediate or final inverse-transform value, and 
the half difference (generated by the pair-wise subtract and divide-by-two operations) 
of the same pair produces another intermediate or final inverse-transform value. For 
example, the unit 112 stores dooo and deoo in a 32-bit partition 170 of a register RegO 
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and stores doio and dei 0 in a second partition 172 of the RegO. Thus, their 
respective half sums generates Too and I'io, and their respective half differences 
generate I V and r 17 . Referring to Figure 9, these are the first and second values Y in 
the first and last rows, respectively, of the transposed block 86. Because it is 
5 desired to store values in the same row in the same registers, the unit 112 stores l' 0 o 
and r 10 in a partition 174 of a register Reg1 and stores r 0 7 and r 17 in a partition 176 
of a register Reg2. The unit 112 loads the other pairs of de and do into the partitions 
as shown, and performs the pair-wise add and subtract and divide-by-two operations 
to store the resulting intermediate inverse-transform values I' in respective registers 

10 as shown. Therefore, the unit 112 stores each half row of the transposed block 86 in 
a respective register. For example, the first half of the first row of the block 86, i.e., 
Too - r 3 o, is stored in Regl Likewise, the last half of this first row i.e., r 4 o - 1 Vo, is 
stored in a register Reg3. Thus, the unit 112 effectively transposes the block 84 
(Figure 8) to generate the block 86 during the same cycles that it generates the 

1 5 values l\ Because the unit 112 calculates and stores the values r anyway, the unit 
112 performs the implicit transpose with no additional cycles. 

Next, the computing unit 112 executes the map operation to even-odd 
separates the rows of the block 86 (Figure 9) and thus generate the transposed 
even-odd-separated block 88 (Figure 10). 

20 Figure 18 illustrates an implicit block transpose and even-odd separation that 

the computing unit 112 performs according to an embodiment of the invention. This 
implicit transpose and even-odd separation allows the unit 1 12 to generate the 
transposed and even-odd separated block 88 (Figure 10) of values r directly from 
- the pair-wise add and subtract and the divide-by-two operations (equations (13) and 

25 (14)). 

Specifically, the technique described in conjunction with Figure 18 is similar to 
the technique described above in conjunction with Figure 17 except that the Masaki 
values are stored in a different order than they are in Figure 17. For example, the 
unit 112 stores do 0 o and deoo in the 32-bit partition 170 of RegO and stores do 2 o and 
30 de 2 o in the second partition 172 of RegO. Thus, their respective half sums generates 
loo and r 20 , and their respective half differences generate l'o 7 and I' 27 . Referring to 
Figure 10, these are the first and second values r in the first and last rows, 
respectively, of the transposed block 88. Because it is desired to store values in the 
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same row in the same registers, the unit 112 stores Too and l' 2 o in the partition 174 of 
Reg1 and stores l' 0 7 and l' 2 7 in the partition 176 of Reg2. The unit 112 loads the 
other pairs of de and do into the partitions as shown, and performs the pair-wise add 
and subtract and divide-by-two operations to store the resulting intermediate inverse- 
5 transform values T in respective registers as shown. Therefore, the unit 112 stores 
each half row of the transposed block 88 in a respective register. For example, the 
first half of the first row of the block 88, i.e., Too, I20, l-Uo, and l' 6 o, is stored in Regl 
Likewise, the last half of this first row i.e., 1 30, I'so, and IVo, is stored in Reg3. 
Thus, the unit 112 effectively transposes and even-odd separates the block 84 
1 0 (Figure 8) to generate the block 88 during the same cycles that it generates the 
values r. Because the unit 112 calculates and stores the values I' anyway, the unit 
112 performs the implicit transposing and even-odd separating with no additional 
cycles. 

Referring to Figures 17 and 18, after the computing unit 112 (Figure 13) 
15 generates the block 88 (Figure 10), it replaces the rows of values D in equation (12) 
with the rows of the block 88, and generates the block 90 (Figure 1 1) of final inverse- 
transform values in accordance with equation (14). The unit 112 then executes the 
map operation to re-order the rows of the block 90 to generate the rows of the block 
37 (Figure 3). The processor 108 (Figure 12) then stores the block 37 with the other 
20 decoded blocks of the image being decoded. 

From the foregoing it will be appreciated that, although specific embodiments 
of the invention have been described herein for purposes of illustration, various 
modifications may be made without deviating from the spirit and scope of the 
- invention. For example, the above-described techniques may be used to speed up a 
25 DCT. 
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